Parsing, Projecting & Prototypes: Repurposing Linguistic Data on the Web
نویسندگان
چکیده
Until very recently, most NLP tasks (e.g., parsing, tagging, etc.) have been confined to a very limited number of languages, the so-called majority languages. Now, as the field moves into the era of developing tools for Resource Poor Languages (RPLs)—a vast majority of the world’s 7,000 languages are resource poor—the discipline is confronted not only with the algorithmic challenges of limited data, but also the sheer difficulty of locating data in the first place. In this demo, we present a resource which taps the large body of linguistically annotated data on the Web, data which can be repurposed for NLP tasks. Because the field of linguistics has as its mandate the study of human language—in fact, the study of all human languages—and has wholeheartedly embraced the Web as a means for disseminating linguistic knowledge, the consequence is that a large quantity of analyzed language data can be found on the Web. In many cases, the data is richly annotated and exists for many languages for which there would otherwise be very limited annotated data. The resource, the Online Database of INterlinear text (ODIN), makes this data available and provides additional annotation and structure, making the resource useful to the Computational Linguistic audience. In this paper, after a brief discussion of the previous work on ODIN, we report our recent work on extending ODIN by applying machine learning methods to the task of data extraction and language identification, and on using ODIN to “discover” linguistic knowledge. Then we outline a plan for the demo presentation.
منابع مشابه
Language Processing for Spoken Dialogue Systems: Is Shallow Parsing Enough?
With maturing speech technology, spoken dialogue systems are increasingly moving from research prototypes to fielded systems. The fielded systems however generally employ much simpler linguistic and dialogue processing strategies than the research prototypes. We describe an implemented spoken-language dialogue system for a travel planning domain which supports a mixed initiative dialogue strate...
متن کاملOntological representations of rhetorical figures for argument mining
This paper surveys ontological modeling of rhetorical concepts, developed for use in argument mining and other applications of computational rhetoric, projecting their future directions. We include ontological models of argument schemes applying Rhetorical Structure Theory (RST); the RhetFig proposal for modeling; the related RetFig Ontology of Rhetorical Figures for Serbian (developed by two o...
متن کاملEnriching Language Data through Projected Structures
This paper explores the potential for annotating and enriching data for minority or endangered languages via the alignment and projection of structure from annotated and parsed data for a resource-rich language such as English. The work presented here draws inspiration from the work of (Yarowksy and Ngai, 2001), who tested the methods for projecting linguistic annotations from one language to a...
متن کاملRepurposing Theoretical Linguistic Data for Tool Development and Search
For the majority of the world’s languages, the number of linguistic resources (e.g., annotated corpora and parallel data) is very limited. Consequently, supervised methods, as well as many unsupervised methods, cannot be applied directly, leaving these languages largely untouched and unnoticed. In this paper, we describe the construction of a resource that taps the large body of linguistically ...
متن کاملA Web-Based Instructional Platform For Contraint-Based Grammar Formalisms And Parsing
We propose the creation of a web-based training framework comprising a set of topics that revolve around the use of feature structures as the core data structure in linguistic theory, its formal foundations, and its use in syntactic processing.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009